MaxEnt power spectrum estimation using the Fourier 
transform for irregularly sampled data applied to a record 
of stellar luminosity 
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Abstract The principle of maximum entropy is ap- 
pUed to the spectral analysis of a data signal with gen- 
eral variance matrix and containing gaps in the record. 
The role of the entropic regularizer is to prevent one 
from overestimating structure in the spectrum when 
faced with imperfect data. Several arguments are pre- 
sented suggesting that the arbitrary prefactor should 
not be introduced to the entropy term. The introduc- 
tion of that factor is not required when a continuous 
Poisson distribution is used for the amplitude coeffi- 
cients. We compare the formalism for when the vari- 
ance of the data is known explicitly to that for when 
the variance is known only to lie in some finite range. 
The result of including the entropic measure factor is 
to suggest a spectrum consistent with the variance of 
the data which has less structure than that given by the 
forward transform. An application of the methodology 
to example data is demonstrated. 

Keywords Fourier transform - Power spectral density 
- Irregular sampling - Maximum entropy data analysis 



1 Introduction 

The analysis of imperfect data is a common task in sci- 
ence. Given a set of measurements sampled over time, 
one commonly uses the Fourier transform to estimate 
the power carried by the signal as a function of fre- 
quency. The forward transform is commonly viewed as 
the best estimate of the amplitude and phase associated 
with basis functions of independent frequency; however, 
the indiscriminate use of the forward transform is not 
appropriate when the data are known to be subject to 
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measurement error, and the problem of irregular sam- 



subjectivity, such as interpolation L 



Malik et al.ll2005D or zer 



(Cenker et al. 


1991 


(Bovd 19921). 



Bayesian statistical inference is a well-esta blished 
methodology for dealing with im perfect data ((Bretthorst^ 
1988t ISivialll99d: ICregorvllioOsI) . The parameters of in- 



terest, here the amplitude and phase comprising the 
power spectral density, are related to the data through 
a model function which may be nonlinear. When that 
function is invertible, its inverse is usually called the 
forward transform of the data, but the methodology 
applies as well to model functions which are not in- 
vertible. The most likely values for the parameters are 
given by those which maximize their joint distribution, 
which takes into account both their likelihood as mea- 
sured by the discrepancy between the model and the 
data and their possibly non-uniform prior distribution. 
A non-uniform prior commonly represents the invariant 
Haar measure under a change of variables, and when 
the number of parameters exceeds the number of data, 
a prior based on entropic arguments is often employed. 
These methods generally fall under the rubric of Max- 
Ent data analysis, as the optimum of the joint dis- 
tribution minimizes the residual while simultaneously 
maximizing the entropy distribution. Essential ly, we 
are extendin g the Lomb-Scargle method (jLombl 1976 : 
Scarglj 1982 ) to incorporate the effect of the measure- 



ment errors on the estimate of the most likely amplitude 
and phase coefficients. 

After a brief review of Bayesian data analysis, we 
investigate the details of its application to power spec- 
tral density estimation using Fourier basis functions. 
The problem of missing data values for otherwise regu- 
lar sampling is easily addressed by working with basis 
functions defined only at the measurement times. We 
will compare the methodology for the case when the 
variance matrix of the data is known explicitly to that 
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for the more likely occurrence of when the variance is 
known only to lie in some finite range with assumed 
independent measurements. The primary result is that 
the entropic prior flattens the power spectrum relative 
to that produced by the forward transform, which max- 
imizes solely the likelihood distribution. We demon- 
strate an application of the method to a signal derived 
from stellar observation, and we close by discussing re- 
cent ideas for improvement of the method so that the 
variance of the coefficients may also be evaluated. 



2 Bayesian primer 

The mathematical language of Bayesian data analy- 
sis is that of condit i onal proba bility distribution func- 
tions (lDurrettlll993 : ISivial E 



mm- 

We notate "the prob- 



ability of A given B under conditions C" as 
pvohiA\B-C)=p{A\cB) 



Pb 



(1) 



dropping the conditioning statement C when it is un- 
changing, but its presence is always implied. The sum 
and product rules of probability theory give rise to the 
expressions for marginalization and Bayes' theorem, 



P = 



pip"" 



{B} 

pIp^ 



(2) 
(3) 



where marginalization follows from the normalization 
requirement and Bayes' theorem follows from requir- 
ing logical consistency of the joint distribution p"*'^ — 
pB,A^ Translated to data analysis, Bayes' theorem re- 
lates the evidence given data y for the parameters X 
yielding model x = x(X), denoted p^, to the likeli- 
hood of the data given the parameters times the 
prior distribution for the parameters , 



y X 



(4) 



where the constant of proportionality p^ represents the 
chance of measuring the data which in practice is re- 
covered from the normalization requirement of the joint 
distribution. 

The essential feature of Bayesian data analysis 
which takes it beyond simple least-squares fitting is 
the use of a non-uriiform prior in appropriate circum- 
stances (jD'Agostinil 119981 ). The role of the prior is to 
prevent one from overestimating structure in the model 
not supported by imperfect data. A prior which ap- 
pears non-uniform in one's chosen variable generally 
represents a prior which is uniform under a change of 
variable to that with invariant Haar measure. For ex- 
ample, consider fitting a two parameter model for a 



straight line x = 5t-|-atoa set of data (t,y) with 
finite CTj, and CTi = 0. A maximum likelihood analy- 
sis (or least-squares for independent data) with a prior 
uniform on both a and h actually has a preferential 
bias for extreme values of the slope, as seen by trans- 
forming that distribution p'' cx 1 to the variable for the 
angle tan^ — b. That transformation is given by p^ — 
p^ \db/d9\ such that p^ oc 1 + tan^ 9. A prior which in- 
stead is uniform over the angle p^ (x 1 leads to a Cauchy 
distribution for the slope p'' oc (1 -I- These priors 

are compared in Figure [1] panels (a) and (b). Centering 
the abscissa t — t — for io = J2d^d'^d'^ / ^d'^d^ 
yields independent estimates for the slope and inter- 
cept (T^j = 0, and the prior for a is uniform cx 1. 
At to, the intercept a is the estimate of the mean of y, 
through which the line of best fit must pass. 

The effect of a non-uniform prior is to shift the loca- 
tion of the solution with maximum evidence away from 
that given solely by the likelihood factor. Writing the 
merit function F — ~ log p^ as a sum of residual and 
prior terms F = R + P and dropping the term for the 
normalization constant, a non- uniform prior VP ^ 
provides an optimal solution \7F(X.p) = when the 
data have very little to say Vi? « 0. Returning to 
our example, let us consider a set of Nd independent 
measurements with uniform variance and express the 
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Fig. 1 Prior distributions (a) and (b) for the slope 6 — 
tan 9 of a linear fit. The dashed line shows a prior uniform 
in b while the solid line shows a prior uniform in 9. The 
model function xf (solid) has a slope with less magnitude 
than that of x_r (dashed) which decreases as the variance 
increases from (c) to (d) 
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Fig. 2 A regularly sampled signal comprised of four sinusoids of unit amplitude, displayed as squares in (a), has the 
power spectrum shown in (b), with the signal frequencies indicated by circles at the top and the discrete Fourier transform 
periodogram marked by dots with the bin widths as dashed lines, and the phase spectrum is shown in (c) with the values 
of the signal phases circled. When irregularly sampled with the same number of measurements (d), the power spectrum 
(e) is able (but not guaranteed) to resolve the signal components. The phase spectrum (f) is not as clearly structured as 
for regular sampling. The interpolated reconstructions displayed as lines in (a) and (d) reproduce the signal values at the 
measurement times 



residual term R — jla^y for = X^dl^^ ^ Vd)^ ■ As 
the data become less reliable cr^ — > oo, the prior term 
dominates the gradient and the estimate for 6 tends 
to zero, indicating that the data may be characterized 
solely by their mean rather than exhibiting a linear re- 
lation to the abscissa. We compare the model vectors 
Xfl and xp- for a set of three data points at two values 
of (T^ in Figure [1] panels (c) and (d), where one can see 
how the slope diminishes as the variance increases. The 
role of the prior is to prevent one from overestimating 
structure in the model when working with imperfect 
data by taking into account the measure factor for the 
chosen parametrization. In that sense, Bayesian anal- 
ysis provides the most conservative estimate consistent 
with the data. 



3 One-sided discrete CFT 

Let us now briefly review the theory of the continuous 
Fourier transform (CFT) in terms of its discrete appli- 
cation (dCFT). Suppose there exists a signal yu{tu), 
where the subscript u reminds us that the data are 
given in terms of unit bearing quantities, sampled at 



regular intervals t„ = tAt for integer t e [1,-/Vt] and 
defining the unit of time Af = 1, possibly with missing 
values Nd < Nt such that t contains only the mea- 
surement times and y the corresponding values. The 
critical frequency for aliasing is fc = (2At)~^ and re- 
lates to the periodicity of the spectrum on an infinite 
frequency axis. For a real signal j/„ = yuy the am- 
plitude spectrum has conjugate symmetry about zero 
frequency, and so we can restrict consideration to the 
positive frequency axis /„ = /A/ for integer / g [0, Nf] 
and A/ = {2N f At)~^ . The Fourier basis functions may 
be represented as a matrix 0, where 

6/4 = %/2exp(z27r/„i„) = \/2 exp(i7r/i/A^/) , (5) 

so that the forward transform Yf — QftytAt can be 
written as a matrix multiplication Y = &Dty, where 
Dt = Afl is a diagonal matrix whose entries are all 
A(. The factor \/2 accounts for the one-sided nature 
of the transform, representing the response at negative 
frequencies which differs only by conjugation. The sig- 
nal energy defined by the sum of squared data values, 

i?,^^y2^t=y^Ay, (6) 
t 
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Fig. 3 The critical frequency fc for irregular sampling de- 
pends upon the greatest common factor of the measure- 
ment times. In (a) and (c) are histograms of the inter- 
measurement period for two samplings of the same signal, 
and their corresponding power spectra in (b) and (d) are 
symmetric about the critical frequency 



is equal to the spectral energy defined in terms of the 
transform coefficients Ey = Y^DyY, where D/ is a 
diagonal matrix whose entries are Af except for the 
first and last which are A//2. The edge-most pixels 
corresponding to frequencies and Nf have a width 
only half that of the others when evaluating the inte- 



J2f 'A/ ill order 



1 /2 

gral over the frequency axis /p df u 
for the limits to be strictly respected, in contrast to the 
usual conventions for forming a one-sided power spec- 
tral density (psd) from a two-sided Fourier transform 
which gives those pixels the same weighting as the oth- 
ers. The units of the signal energy differ from those 
of physical energy by a factor of the load impedance, 
and the amplitude of the transform coefficients carries 
units of uy = UyAt, as seen from the inverse transform 
y — Re@^DfY, where 0^' = (0*)'^ is the conjugate 
transpose of the basis functions. 

In order for the reconstruction to replicate the data 
(up to round-off errors), one must take a sufficient dis- 
cretization of the frequency axis. For regularly sampled 
data Nd = Nt, one has the requirement Nf > \Nt/2, 
and when Nf is at its minimum the psd of the one-sided 
dCFT corresponds to that of the disc rete Fourier trans- 
form periodogram ([Press et al.lll992l ) up to the factors 
of 2 for the edge-most pixels, as seen in Figure [2] pan- 



els (a) through (c). As Nf increases, the resolution 
along the frequency axis improves so that the remain- 
der of the CFT is evaluated. For irregularly sampled 
data Nd < Nt, one requires Nf ~ Nt/2, but to achieve 
sufficient resolution it is better to take Nf — 2Nt, as 
shown in Figure [2] panels (d) through (f). Virtually all 
cases of irregular sampling will correspond to the miss- 
ing values problem once the greatest common factor of 
the measurement times is identified as At , but if the in- 
dividual measurement durations are not all equal then 
a suitably generalized Dt must be used. To interpolate 
(or extrapolate) the data, one simply replaces t — )• t' 
in the inverse transform. One caveat is that the inverse 
transform is required to agree with the data only at 
the measurement times, so that for increasing resolu- 
tion along the frequency axis the interpolant is driven 
progressively towards the signal mean. 

The identification of fc, which is the lowest positive 
(nonzero) frequency whose basis function is entirely real 
over the measurement times, is confirmed by evaluat- 
ing the psd over the domain G [0, 1], as seen in Fig- 
ure [21 where panels (a) and (c) are histograms of the 
inter-measurement period for two signals and the cor- 
responding spectra in (b) and (d) are symmetric about 
/„ — 1/2, recalling = 1. There is no great mystery 
as to how an irregular sampling can resolve a frequency 
above the Nyquist limit for the corresponding regular 
case, as it is the Nyquist limit which must be defined 
in terms of the irregular sampling. When implement- 
ing the continuous Fourier transform in a discrete set- 
ting, it is the sampling of the signal which induces the 
periodicity in the spectrum, while the finite signal du- 
ration cat ises side-lobes to appear in the point spread 
function (jJohnsonl 120121 ). The location of the critical 
frequency fc must be known so that the normalization 
of the power spectrum can be evaluated over one (half) 
period of the frequency axis. 

While the amplitude or power spectrum usually re- 
ceives more attention from investigators, it is actually 
the phase spectrum which carries most of the informa- 
tion contained in the signal. An amusing example of 
this phenomenon is the demonstration of how to make 
a mandrill look like a girl. Taking the 2D Fourier trans- 
form of two standard test images shown in Figure|4]pan- 
cls (a) and (b) using a stock FFT algorithm, one can 
combine the amplitude spectrum of one image with the 
phase spectrum of the other to produce two new images 
from the inverse transform. The resulting images will 
appear to the eye to resemble the original image asso- 
ciated with the phase used in the combination rather 
than the amplitude. These two combinations of our 
test images are displayed in Figure |4] panels (c) and 
(d), where indeed exchanging the phases has made the 
mandrill look like a girl and vice versa. 
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Y = exp i (|)^ 



Y = exp i 




Y = A exp i (|), 



Y = A- exp i . 



(c) 




Fig. 4 The phase spectra of two images Mandrill (a) and 
Lena (b) may be exchanged to produce two new images (c) 
and (d) which resemble the image associated with the phase 
more than that associated with the amplitude 



4 MaxEnt psd estimation for known data 
variance 

An important feature of Bayesian data analysis is that 
it allows one formally to address imperfect data as well 
as to incorporate a prior contribution to the evidence 
distribution. Rather than assigning errors to the for- 
ward transform coefficients based on the data variance, 
the evidence for a set of coefficients is evaluated from 
the likelihood of measuring the signal given its vari- 
ance, and the prior accounts for the measure factor 
for the chosen parametrization. When th ere are man' 
more parameters than data values [what Isivia ( 199 
calls a "free- form" r nodel], an entropic prior i s often 
used (jSkiUinj Egsi buck and Macaulavl Eooi ): how- 
ever, its usual expression is derived from an approx- 
imation to the Poisson distribution which introduces 
unnecessarily a factor for the conversion of units. That 
factor is often said to be a Lagrange multiplier yet is 
treated as if it were a parameter of the model, which not 
a consistent approach. With the obvious generalization 
of the discrete Poisson distribution to a continuum (de- 
scribed in the Appendix), the prior for the amplitude 
coefficients can be written without reference to an arbi- 
trary unit factor. In this section, we construct the merit 
function in terms of physically relevant quantities and 
show its reduction to the usual form as a consequence of 



normalization. In its physically normalized form, this 
merit function can be related to those used in statistical 
physics and lattice gauge theory. 

4.1 Model, residual, and constraint 

Suppose that we are also given along with y a vari- 
ance matrix V whose entries are the covariance of the 
measurements Vjk = a'^{yj,yk), from which we form 
the normalized weight matrix W = /TrV~^ such 
that TrVF = 1. One then defines the residual signal 
energy Re ^ r^WrNdAt in terms of the residual vec- 
tor r = x — y, the weight matrix W, and the trace 
of the time metric Tr Dt = N^At, {cf. the expression 
for Ey with y ^ r and Dt -> WNdAt). The residual 
vector ultimately is expressed in terms of the model pa- 
rameters for amplitude and phase Xj = exp' 
where the notation exp" (a;) = (expx)" — is similar 
in spirit to its trigonometric counterpart, through the 
model function 



= AfV2 



i (^0 + cos 7rt) 



Nf-l 



(7) 



(8) 



with parameter domains of and ^^V/ € [~oo,oo], 
Af € [0,oo], and (j)f periodic in [— 7r,7r]. The frequen- 
cies of and /c must have a phase of or tt because 
their basis functions are entirely real over t. The as- 
signment of uniform priors p"* cx 1 and p"^ cx 1 to the 
amplitude and phase parameters corresponds to find- 
ing the maximum likelihood solution for the spectrum 
given the data and its variance. 

The likelihood of a signal y with errors described by 
a variance matrix V given its "true" value x is written 
as a multivariate Gaussian v °^ exp~-'^(x^/2), where 
= r-^'V^^r. The subscript E on Re reminds us 
that it carries units equal to the signal energy and must 
be normalized before appearing in the argum ent of an 
expo nential function. In statistical physics (jWannier 
1969h . the normalization of the action appearing in the 
Maxwell distribution is given by the fluctuation energy 
of the system. Here, let us write the normalization as 
Pi/eRe, where Pi/e = ^/kT is the inverse thermal 
energy for some generalized temperature T describing 
the uncertainty to which the measurements are subject. 
Rearranging factors, we have 

P.^eRe - (2/?i/£7VdAt/Try-i)(r^F-ir/2) (9) 
= m, (10) 
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where R = We now argue that /3 = 1 as follows. 

Substituting for the thermal energy, 



NdAt/TrV- 
kT/2 



(11) 



and recalling that kT/2 equals the average fluctuation 
energy per quadratic degree of freedom, we see that the 
numerator and denominator are equal to ct^Aj, where 

is the reciprocal of the average of the eigenvalues of 
V^^. The normalized residual R is seen to be the ratio 
of the residual signal energy given by the discrepancy of 
the model to the thermal energy given by the variance 
of the data. If one were to scale the residual by some 
arbitrary factor /3 ^ 1, that would be tantamount to 
saying that the experimentalists contributing the mea- 
surements have misrepresented the ratio of the units of 
their signal to its deviation. 

By itself, the residual term R is not sufficient to iden- 
tify uniquely a maximum likelihood solution to the op- 
timization problem. The reason is because the model 
function Equation ([7]) is surjective but not injective. 
What that means is that, for a sufficient discretization 
of the frequency axis, there exists a continuous family 
of coefficients {X/?} that can produce a vanishing resid- 
ual i?(Xij) — 0. While the forward transform Y(y) is 
one-to-one, the inverse transform x(X) such that x = y 
is many-to-one; the forward transform coefficients are 
identified uniquely as the member of {X^?} whose spec- 
tral energy equals the signal energy iJx = Ey . In order 
that the maximum likelihood solution should equal the 
forward transform coefficients, the merit function must 
be supplemented with a term enforcing the constraint. 

Using subscripts to indicate which terms are appear- 
ing in the merit function, the negative log likelihood of 
the data, given the constraint on the power spectrum 



(12) 



/ 



is written Fuc = R + AC, where A is a Lagrange 
multiplier enforcing C — 0. The constrained opti- 
mum of R coincides with the unconstrained solution 
of Va.x-Fjjc = 0, which is a saddle point in the space 
(A, X). The presence of the constraint makes the assign- 
ment of errors to the coefficients difficult; not only are 
there correlations between the amplitudes and phases 
but also a restriction on the allowed directions the vari- 
ation in the amplitudes may take. If one amplitude 
increases, some other must decrease so that the nor- 
malization condition is respected. For these reasons, 
the results appearing henceforth for the psd are under- 
stood to be conditioned on satisfaction of the constraint 
C, and its variance in principle is recoverable from the 
Hessian of the merit function but not in a form which 



is practically useful for the chosen parametrization. In 
a later section we will discuss some recent thoughts on 
improvements to the method so that an estimate for 
the variance of the model parameters may be obtained. 

4.2 Entropy and the Poisson distribution 

The ent ropic spectral energy for unnormahzed distri- 
butio ns (|Skillina]|l989[ iBuck and Macaulavlll992t ISivia 
19961) is typically written 



[4 



m/-4log(4/m/)] Af 



(13) 



where the sum over / is understood to take into ac- 
count the frequency metric Df. The factor m/ rep- 
resents the default model, which in this case is a flat 
spectrum nif = m given by the Lebesgue measure with 
the same energy as the signal, m = 2EyAt such that 

1/2 

m Jp dfu — Ey. To evaluate the terms of Se when 
Af ^ 0, one needs to write OlogO = logO° — 0. 

To normalize the Lebesgue measure, one divides the 
entropic spectral energy by the signal energy, letting us 
write the normalized entropy S = Se/ Ey as 

^ = E 14 [1 - log(4/2£;,A0] Af/Ey) - 1 . (14) 
/ 

Under the condition ^/^/ ~ '-'^^ could manip- 
ulate the terms evaluating to a constant, reducing the 
entropy to the familiar expression —Sc = ^fPf logp/ 
for pf = A'j/2NfEyAt and respecting the pixel mea- 
sure, but the numerical optimization is over the entire 
space of X so that only the explicit constant may be 
dropped for satisfactory convergence of the algorithm. 

The quantity of physical relevance is the negentropy, 
which is the difference between the maximum attain- 
able entropy and that of any particular configuration 
and so by definition not negative. The equivalent re- 
quirement for the entropy expression is that it be non- 
positive. For the following examples, let us suppose 
Ey = 1, and consider first a single frequency Nf = 1 
whose pixel spans the Nyquist interval Af — (2At)^^ 
such that m = 2. In Figure [5] panel (a) we compare 
Se and Sc for this case, and we see that they do in- 
deed agree at the location which satisfies the constraint, 
A = 771^/^, where both expressions equal zero. The re- 
duced entropy Sc, however, takes both positive and 
negative values over the unconstrained amplitude axis. 
Replacing the base e of the logarithm with base 2 such 
that Sc equals the Shannon entropy, as shown in panel 
(b) , the expressions again agree at the value of the con- 
straint, but now the corresponding Se as well takes 
on positive values. We are thus led to believe that the 
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Fig. 5 Comparison of the entropy expressions Se (solid) 
and Sc (dashed) for Ey = \ and N j — 1 using logarithm 
base e in (a) and base 2 in (b), with A — \/2 and S — 
indicated by dashed lines 



physical form of the ncgentropy expression is that given 
by Se in Equation (fT3l) normalized by the Lebesgue 
measure and with the natural logarithm. 

Let us now compare the expressions for Se and Sc as 
the number of frequencies spanning the Nyquist inter- 
val increases. In Figure |6] we show those expressions as 
Nf goes from 2 to 5. (We are assuming here a frequency 
grid dual to the one used elsewhere so that there are no 
edge effects.) The constraint C is now satisfied not by 
a single value but by Nf values for A with the required 
sum of squares. The most obvious difference in behav- 
ior is that the maximum of Sc does not remain fixed at 
m^/^ the way that it does for Se- The presence of both 
positive and negative values for Sc has a remarkable 
impact when one follows the mainstream approach of 
multiplying the entropy by a factor a claimed to play 
the role of a Lagrange multiplier in the merit function 
Frco = R+XC—aS for S ^ Sc- Under enforcement of 
the constraint Sc = for nontrivial amplitudes A ^ 0, 
contributions with positive entropy must be balanced 
by contributions with negative entropy, so that the am- 
plitude spectrum is drawn by the residual in either di- 
rection away from its nontrivial value where each contri- 
bution is zero. The effect is to induce a spikiness to the 
spectrum, where relatively few large amplitudes with 
negative entropy (which is unbounded) are balanced by 
many small amplitudes with positive entropy (which is 
bounded). While resolution enhancement is often a de- 
sired goal (which implies consideration of some point 
spread function) , the use of a term aSc is not the cor- 
rect way to go about it. When one uses the normalized 
negentropy Fuca for S = Ss/Ey, the maximum of en- 
tropy coincides with the only values which can satisfy 
the constraint Se — 0, namely the coefficients given by 
the default model Af = m}/'^. 

The use of a as a Lagrange multiplier brings up var- 
ious questions regarding the stopping criterion for the 
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Fig. 6 Comparison of the entropy expressions Se (solid) 
and Sc (dashed) for Ey = 1 and Nf taking the value 2 in 
(a), 3 in (b), 4 in (c), and 5 in (d), with the maximum of 
Se at j4 = \/2 and S = indicated by dashed lines 



algor ithm ( Brvan 1990l : Strauss et al. 1993 : MacKav 
199^. Many practitioners use some variation of the 
method described by Sivia ( 19961 ) in which a is treated 
as a parameter of the model; however, that approach 
is not consistent with the interpretation of a Lagrange 
multiplier whose sole purpose is to equate the norms of 
the gradient vectors for the residual and the constraint. 
The stationary points of the Lagrangian merit function 
describe locations satisfying the constraint where the 
residual changes only in directions which are forbidden. 
Similarly to the remarks concerning /3, the use oi a ^ I 
is tantamount to an arbitrary rescaling of units between 
the entropic and signal energies Se and Ey . By writing 
Fftsc = R — S + AC, the MaxEnt solution Xi? is car- 
ried smoothly from the forward transform coefficients 
^RC = Y for (T^ — >■ to the coefficients with maximum 
entropy given by the Lebesgue measure as the mean 
magnitude of the data variance increases, X^? — !■ Xg^ 



as a 



Re turning to the literature ( Skillinel[l989t Buck and Macaulav^ 
1993), let us reexamine the arguments leading to the 
entropy expression found in Equation (jl3l) . When in- 
voking the hypothetical troop of monkeys, one supposes 
not only that the image, here the power spectrum, is 
comprised of discrete pixels A / but also that the pixel 
values are themselves described discretely in terms of 
some presumably small quantum of image, here power, 
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Fig. 7 Comparison of tiie continuum Poisson distribution 
(solid) witli its Stirling approximation (dashed) for m — 2, 
with the peaks at and indicated by dashed lines and 
the corresponding values of the discrete Poisson distribution 
circled in (a) 



such that Aj = e 71/ for integer n/ , in order to write the 
measure as a product of discrete Poisson distributions 
p(nf\vj:) with parameters Vf = v, taken here to be uni- 
form. Considering a single pixel p", the requirement 
that e be small so that n is large allows the Stirling 
approximation to the factorial, 



(15) 
(16) 



and transforming variables to those for amplitude — 
n such that dn/da = 2a lets one write 



pt = p:\dn/da\, 



(17) 
(18) 



The inverse of the quantum e ^ = a is then identified 



as the prefactor in the expression aSs 



flj log(a^/z^)]A/. The motivation for the introduction 
of a hinges entirely on the discrete form of the Pois- 
son distribution. Its definition here as a unit factor for 
quantization is not consistent with its interpretation 
previously as a Lagrange multiplier. 

We propose that there is no need to introduce a when 
one considers the extension of the Poisson distribution 
to the continuum using the obvious substitution n\ — 
r(n -|- 1), where now n = and the parameter in the 



given units is fi. In the Appendix we discuss the use of 
this expression as a probability density function. For 
a single pixel, the measure of the power distribution is 
given by 



p(A^|/i) = e-V7r(A^ + i) 



(19) 



which if maximized at a value = m has a pa- 
rameter = exp[Ai(m -|- 1)], using the notation 
Afe(r) = {dr)^\ogT{r) for the polygamma functions 
with real argument and integer fc > 0. In terms of 
amplitude, the distribution is 



piA\f,)^\A\p{A^\f,) 



(20) 



which has its peak not at Am but at ~ [i^l"^ . Using 
the notation q = — logp, one can write 



q{A'\^l) 
q(A\^Ji) 



li-A^\ogli + Ko{A^ + l), (21) 
q{A^\ti)~UogA\ (22) 



having dropped a factor of 2. Under the Stirling ap- 
proximation, 

Ao(A2 + 1) « log -A^ + \ log(2^A2) , (23) 
the expressions for q are 



q{A'\^Ji) « ^i-A' + A'\og{Ay^i) + -\ogA\ (24) 



q{A\^^) 



\^ + A^\ 



^,-A' + A'\og{A'/^x) , 



(25) 



having dropped numerical terms. The full expressions 
for p and q are compared for m = 2 in Figure [71 and 
in panel (a) the corresponding values of the discrete 
Poisson distribution are circled. Using Equations (PT|) 
and (j22p . we can now write the prior measure for the 
amplitudes in terms of its contribution P to the merit 
function Frpc = R + P + AC as 

P^J2i^oiA} + l)-llogA}-A}logfi]Af/Ey , (26) 

taking into account the pixel measure D f and dropping 
the constant term proportional to fj,. The parameters 
^0 Siiid An^ have twice the range but only half the 
pixel width, so their prior normalization is consistent 
with the others. 

4.3 Finding the solution 

To solve the optimization problem for F = F^pc, one 
needs the gradient G = VF and the Hessian H = 
V^VF, with V"^ a column vector of derivatives V"^ = 
[c'ajSx]"^ for 9x (a row vector) the covariant gradient 
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in X, written so that matrix multiplication is embodied 
in the notation. The gradient of the residual vector 
9xr = 9x(x — y) is a matrix with a column index for 
the parameters X and a row index for the measurement 
times t. The Hessian operator has the dyadic form 



V^V = 



(9^ 



[dx dx] 



dxdx 



(27) 



and the A dependence is d\F ~ C, dfF = 0, and 
d\dxF = dxC The residual term has contributions 



r^V-^ (Sxr) , 



(28) 



a^axi? = (9xr)' (axr) + (a59xr)(2C)) 



where (c^xC^xr) is contracted along its time index with 
r^V^^. Ordering the parameter vector using the no- 
tation X^ = [Ao,ANf, ^] and letting u/ 
nft/Nf, we can write 



cos (7rt^) 
2 cos ( 0/ — u J 



2Af sin 



in (0/ - uj) 






















-2sin((/)/ -u/) 



_-2sin(0/ - u/) -2A/cos(0/ - u/) 
and for *±(A) ee Ai(^2) ± {2A'^y^ - log/i + A, one has 



E,, 



ylo*+(Ao) 

ANf-^+{ANf) 

2Af^+{Af) - 


2AlA2iAl) + ^_{Ao) 

2^2^^.A2(A^^.)+VI/_(A^J 



4A2A2(A2 







fj^2^^{Af) 




(30) 



(31) 





0" 




(32) 



(;;3) 



/J 



With these evaluations, the solution with maxi 
mum evidence Xi? may be found using commonly 



availa ble numerical optimization routines ([Press et al 
19921) . Let us compare the amplitude spectra for Y 



and Xi? given a signal with unit variance and missing 
values. The simulated signal displayed in Figure [5] (a) 
is comprised of four sinusoids of unit amplitude and 
Gaussian noise of unit variance and has the forward 
transform power spectral density \ Yf\'^ shown in (b) and 
the phase spectrum shown in (c) , where two of the four 
signal components have been well resolved for this par- 
ticular sampling. The effect of including the variance is 
seen in Figure[5](d), where the peaks have been reduced 
to a level not far above the noise, and in (e) , where the 
phases have adjusted slightly. Recalling that the psd is 
proportional to a probability distribution, we see that 



the evidence for the signal components is reduced by a 
factor of nearly 2 compared to their likelihood when the 
signal variance is on the order of the signal magnitude 
squared. Note that the signal energy of the reconstruc- 
tion is generally less than the original signal energy. 
Ex < Ey, indicating that phase cancellations occur in 
the spectrum with maximum evidence. 

The effect of the non-uniform prior for the ampli- 
tudes A f has been to draw the power spectrum towards 
the default model given by a flat spectrum. The amount 
by which the spectrum is flattened depends upon the 
magnitude of the variance of the data. The phase pa- 
rameters 0/ are given a uniform prior so that they are 
free to adjust as needed to bring the model vector x 
as close as possible to the data vector y. In practice, 
the phases are not bound to their principle branch so 
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Fig. 8 The irregularly sampled signal is displayed in (a) as 
dots with error-bars. The maximum likelihood psd in (b) is 
given by the forward dCFT, and its phases are shown in (c). 
The maximum evidence psd in (d) accounts for the variance 
of the data by bringing the psd closer to a flat spectrum, 
and the phases in (e) have varied slightly. The reconstructed 
signal is shown in (a) as a solid line 



that the algorithm does not get hung up on a branch 
cut; they are simply reduced to the principle branch 
after the optimization. The maximum evidence solu- 
tion is more conservative than that given by the for- 
ward transform, in that less structure is assigned to 
the power distribution. Features which persist in the 
spectrum are the most likely to be of significance. 



5 MaxEnt psd estimation for unknown data 
variance 

Let us now look at how the methodology changes when 
the variance of the data is known only to lie in some 
finite range with assumed independent measurements. 
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Fig. 9 The likelihood distribution (solid) for data with 
unknown variance is proportional to the product of two fac- 
tors, r^^" (dashed) and Ar (dash-dot), here normalized by 
r(a) for a = 5 and cr^ G [0.1, 1] 



The expressions for P and AC remain the same, so we 
will focus on the changes to the residual term R. It 
will prove convenient to work with the squared norm 
of the residual vector r^, so that — r^(Vr) and 

V^VrV2 = (Vr)^(Vr) -h r^(V^Vr). The likelihood 
distribution is now expressed as an integral over the 
nuisance parameter for the data deviation cr. 



d-^"-! exp j da oc (r2)-^'^/2Ar 



(34) 



where the integrand includes the Jeffreys prior 1/a 
as well as the normalization of the Gaussian and 
Ar = r{Nd/2,r^/2af) - r{Nd/2,r^ /2a^) in terms 
of the upper incomplete gamma function T{a, z) = 
!T G~^u°-~^du. In the limit of cto and cti — > oo, 
one has Ar — > r(a) which is absorbed by the nor- 
malization such that the minimum of R is given by 

— > 0, but for a finite range a £ [ctojO'i], the di- 
vergence of the factor r"^" is canceled by its recipro- 
cal appearing in the continued fraction expression of 
r(a, z) = exp(— z)2°(z + . . .)^^ so that the likelihood 
is effectively constant for below a certain threshold 
and also reduced significantly for large values of , as 
seen in Figure [HI 

Considering the same signal as in the previous sec- 
tion, let us first suppose that CTq G [1, 10]. The resulting 
MaxEnt psd is shown in Figure [10] (b) , and we see that 
it is nearly identical to that produced by the previ- 
ous analysis which assumed a unit variance. When the 
variance of the data is known only to lie in some finite 
range, the evidence is dominated by the contribution 
from the lower bound on the data variance. Let us now 
suppose an extreme case of (Tf, G [10,100], where the 
variance exceeds the magnitude of the signal. The psd 
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Fig. 10 The MaxEnt psd for unknown data variance in 
the range aa G [1, 10] in (b) and (c) is exceedingly similar 
to the psd for known variance a — 1. When the data is of 
poor quality, at G [10, 100] , the Max;Ent psd in (d) in (e) 
is very close to the default model of a uniform spectrum. 
The reconstructions in (a), o for (Ja and □ for at, are drawn 
closer to the signal mean as the lower bound on the data 
variance increases 



in this case, Figure ITOl (d). is now very close to the uni- 
form distribution given by the default model of the prior 
term — the previously resolved signal components regis- 
ter barely a ripple on an otherwise flat power spectrum. 
Probability theory has prevented us from overestimat- 
ing structure in the spectrum when the quality of the 
data is suspect. As the lower bound on the variance 
increases, the reconstruction from the model function 
grows closer to the mean of the data, as seen in Fig- 
ure [10] (a), such that the reconstructed signal's energy 
decreases from the value given by the original data, 
Eb < Ea < Ey, indicating that phase cancellations be- 
come more prominent in the spectrum. 



6 Application to a record of stellar luminosity 

Let us now apply the MaxEnt methodology to some 
real da ta. The signa l chosen here is a record of lumi- 
nosity (jHendenll2011l ) for the star VCas dating from the 
beginning of January, 2010, to the end of January, 2011. 
Some measurements are given with a temporal resolu- 
tion of seconds, but to keep the analysis tractable we 
assign the data to a daily time axis. When more than 
one measurement falls on the same day (a rare occur- 
rence) their mean and its variance are used. To reveal 
the low frequency content, the arithmetic mean is sub- 
tracted before conducting the spectral analysis. The 
choice of the arithmetic rather than the weighted mean 
is made because it is the arithmetic mean which appears 
in the frequency bin of the forward transform. 

The data span Nt — 390 days, as shown in Fig- 
ure [11] (a), and so a, Nf = 780 point frequency axis is 
used. The error bars are hard to see because they are 
small. The MaxEnt reconstruction also shown in panel 
(a) is driven towards the signal mean because of the 
oversampling of the frequency axis. The forward trans- 
form psd displayed in panel (b) shows a prominent low 
frequency peak as well as a few others with a period 
greater than 30 days. The transform is evaluated on 
a linear frequency axis but displayed on a logarithmic 
axis so that the low frequency region is more easily ob- 
served. The MaxEnt psd shown in (c) has not changed 
much, as expected from the small magnitude of the er- 
rors, but by incorporating the data variance and the 
amplitude measure, it provides a more conservative es- 
timate of the spectral density. The phase plots have 
been suppressed as very little variation is seen between 
the algorithms for this data set. 

The locations of the four lowest frequency peaks are 
given in Table]!] as is their ratio to the lowest frequency. 
These frequency peaks do not appear to be in a har- 
monic relation with the fundamental frequency of the 
signal. Note that the MaxEnt algorithm does not alter 
the locations of the peaks but does broaden their widths 
so that the variance of a frequency estimate increases 
with the noise level of the data. For this particular 
data set, a single or few frequency model (jBretthorst 
19881 ) might be more appropriate; however, this signal 
was chosen simply as an example of the type of data to 
which the MaxEnt psd algorithm for irregular sampling 
is applicable. 



7 Variance of the psd 

In this section we present some recent thoughts on how 
the methodology might be improved so that an esti- 
mate of the variance of the model parameters may be 
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Fig. 11 The MaxEnt reconstruction of tlie VCas luminos- 
ity signal in (a) is driven to the signal mean between the 
measurement times by the oversampling of the frequency 
axis. The time axis is given in units of days, and the fre- 
quency axis in units of cycles per day. The forward trans- 
form psd in (b) displays low frequency structure which be- 
comes more apparent on a logarithmic frequency axis, where 
the locations of the four lowest frequency peaks are indi- 
cated by dashed lines. The MaxEnt psd in (c) is drawn 
only slightly towards a flat spectrum because of the small 
magnitude of the data variance. The phase plots have been 
suppressed as little variation is seen between the forward 
transform and MaxEnt spectra 



obtained. The difficulty with the assignment of errors in 
the method described so far results from the appearance 
of the constraint term AC in the Lagrangian, so that 
the stationary point for the constrained optimum is at 
a saddle point in the extended parameter space (A, X). 
Obviously, the solution is to devise a methodology in 
which no constraint appears, so that the Hessian H of 
the merit function F at its optimal value G(X^) = 
provides an estimate of the variance of the model pa- 



Table 1 Peak locations in the power spectral density of 
the VCas data in units of days 





first 


second 


third 


fourth 


period 


220.82 


102.11 


64.67 


36.98 


freq. 


0.004529 


0.009793 


0.015463 


0.027038 


ratio 


1 


2.163 


3.415 


5.971 



rameters. As the normalization of the constraint C is 
given by the signal energy Ey, that is where we will 
look for improvements. 

Rather than constrain the spectral energy Ex. to the 
value given by the sum of squared data values Ey in 
Equation ([6]), what we need to do is evaluate the prob- 
ability of Ex for any set of coefficients given the data 
y and its variance V. Let us rewrite the normalized 
signal energy in terms of the expectation value of the 
squared data values. 



Ey/NdAt = {yt)w = Wy/TvW , 



(35) 



here using the unnormalized weight matrix W = V ^ . 
Simply assigning that value a variance given by 



{{y^ ~ {y!)^v)')^v = {yf)w - (y^) 



(36) 



turns out to be a bad idea, as the distribution for the 
energy p{Ey\y, V) is nothing like a Gaussian. 

As Ey and {yt)w differ only by a scaling factor 
N(iAt, let us focus our attention on the distribution 
p{y'^\yj^)- Considering some signal y, first suppose 
that its variance is proportional to the identity ma- 
trix V — a^I. The individual datum distributions for 
k e [1, A'd] are then given by normalized Gaussians, 



p{Vk\yu<y^) = (27ra2)-i/2exp-i/2 [(y, _ y^f/^^ 



(37) 



which need to be folded over to a strictly positive axis 
Zfe = l^fcl by writing 



p(zfe|?/t,CT^) (X exp [(zfe - j/t)^/o-^ 



■ exp 



(38) 
(39) 



That distribution is marginalized over k so that 
p{z\y,a^)=Y,P{zk\yu<J^)/Nd (40) 



may then be scaled for y^ — to yield 
p{y^\y,a^)=p{z\y,o^)\dzldy^\ , 



(41) 



which gives the distribution for the estimate of the sig- 
nal energy given the data and its variance. For the 
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Fig. 12 The logarithm of the distribution p{y^\a^) is 
shown for cr^ = 10"^ in (a), 10"^ in (b), 10"^ in (c), and 
10"* in (d), where the expectation value of is indicated 
by a dashed line. The expectation value as a function of 
is shown in (e), where the arithmetic mean of the squared 
data values is indicated by a dashed line 



signal y in Figure [U let us evaluate p(y^|y, cr^) for 
fj^ e [10-^,10^1]. In Figure [H panels (a) through 

(d) we show the logarithm of that distribution for the 
four values of equal to powers of 10, and in panel 

(e) we show the expectation value / y^p{y^\y,a^)dy^ 
as a function of a^. As cr^ — > 0, the distribution in 
approaches a sum of delta distributions located at the 
values of y^, and the expectation value approaches the 
arithmetic mean of the squared data values. 

If the variance matrix for the data is not propor- 
tional to the identity, one has to account for the covari- 
ance (off-diagonal elements) when evaluating the dis- 
tribution for the signal energy. To orthogonalize the 
data, one must decompose the weight matrix W into 
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Fig. 13 The datum distributions indexed by k in panel 
(a) with independent unit variance have a distribution for 
normalized energy shown in (b). Assuming the data vari- 
ance is given by a symmetric Toeplitz matrix with the same 
trace, the orthogonalized data in (c) have the distribution 
for normalized energy shown in (d), which has the same 
expectation value for 



its eigenvalues W and eigenvectors E, so that 

y'^T^y y^E^W'Ey = y'^W'y' (42) 

yields the same expectation value for y^. The orthogo- 
nalized datum distributions 



p{yk\y't,w',) = (27rM)-i/2exp-i/2 [{y,-y'^fwQ 



(43) 



where u;^ = W^j,, are then folded, marginalized, and 



scaled as before to yield p{y'^\y' ,W'). For the given 
signal y, we first test the method for V = I with trace 
Nd, with the results shown in Figure [T51 (a) and (b). 
In this case y' — y, and p{y^\y' ,W') is the same as 
p(y^|y, — 1) from the previous paragraph. Now 
we suppose the data variance is given by a symmet- 
ric Toeplitz matrix Vjk = l/(l-l-|j — A:|) with the same 
trace N^- The orthogonalized data y' shown in panel 
(c) have a distribution p{y'^\y' , W) shown in (d) with 
the same expectation value / y'^p{y'^\y' , W')dy^ as be- 
fore. The orthogonalization of the data does not affect 
the expectation value of the normalized energy distri- 
bution but does affect its shape according to the inde- 
pendent datum distributions. 
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To make contact with our goal, we must now idcn- 9 Appendix 



tify y with the normahzed spectral energy Ex/NdAt 
given by the amplitude coefficients A. The merit func- 
tion FjiPQ = R + P + Q now includes a term Q = 
— logp{Ex\y, V) for the distribution of the spectral en- 
ergy and does not include any explicit constraint. The 
expression Ex. is not a "hidden" variable but rather an 
auxiliary variable defined in terms of the model param- 
eters. The term Q is in effect (the negative logarithm 
of) an additional prior factor in the amplitude space 
A which accounts for the distribution of the spectral 
energy given the data and its variance. Furthermore, 
the factors Ey and ^ must now be rewritten in terms of 
Ex where they appear in the Poisson prior P of Equa- 
tion p6p . The implementation of these improvements 
is currently under investigation; we have outlined only 
one possible approach here, and some variation might 
be found to work better in practice. 



8 Conclusions 

In this article we have applied the principle of maximum 
entropy to power spectral density estimation using the 
one-sided Fourier transform in the context of discrete, 
irregular signal sampling. We have dispensed with the 
arbitrary weighting of the entropy term in the merit 
function in favor of a constraint on the spectral energy. 
The prior is rewritten in terms of the continuous Pois- 
son distribution whose Stirling approximation gives the 
familiar entropy expression for unnormalized distribu- 
tions. In the limit of vanishing errors, the spectrum 
with maximum evidence is equal to that with maximum 
likelihood given by the forward transform coefficients, 
and in the limit of extreme errors it approaches a flat 
power spectrum with the same energy as the signal. 
An outline of improvements to the method to obtain 
the variance of the spectral coefficients is also given. 

As an example, we have evaluated the power spec- 
trum of an irregularly sampled record of stellar luminos- 
ity for the star VCas. Several prominent peaks in the 
spectrum survive the smoothing action of the MaxEnt 
algorithm, which prevents the overestimation of struc- 
ture in the spectrum when confronted with imperfect 
data. In that sense, the MaxEnt algorithm gives a more 
conservative estimate for the power spectrum than the 
forward transform. As actual measurements necessarily 
are accompanied by measurement errors, the incorpo- 
ration of their effect through the principle of maximum 
entropy is suggested to those who use the Fourier trans- 
form for power spectral density estimation. 



In deriving the usual entropy expression, recourse is 
made to the discrete Poisson distribution 



P 



(44) 



with parameter fi for integer fc > 0, where the propor- 
tionality is given by the normalization X]fc°=o A**^/^' — 
e^. For ^ and k in some units except for the expo- 
nent, there are k unit factors in the numerator which 
cancel k unit fac tors in the deno minator. While it has 
been suggested ( Marsaglia 19861 ) that the cumulative 
distribution is what should be generalized to the con- 
tinuum, it seems more intuitive that the probability 
density be generalized through the obvious substitution 
fc! — >■ r(n + 1) for continuous n > 0, giving a contin- 
uous Poisson density oc /i"/r(rt + 1). About the 
only objection one could raise for such a density func- 
tion is one of normalization over an infinite axis, which 
itself is mooted by consideration of some finite cutoff 
J^max larger than any scale of interest in the problem 
at hand. The use of an unnormalizable distribution is 
implicit in the maximum likelihood method, as the uni- 
form distribution is only finite when considered over a 
restricted domain. 

As almost any non-negative function with finite do- 
main can be normalized to a probability density, let us 
consider the normalization of the continuous Poisson 
distribution over an infinite axis. Can we show that 



1) 



Taking the derivative with 



respect to the parameter /i of both sides, the RHS is 
the definition of the exponential function, df^e^ = e^. 
The derivative of the LHS leads to the expression 



a, 



dn 



dn 



°° //"-I dn 



" 'o r(n+l) - J, r(n + l) J, Tin) 
whereupon shifting the limits down by one unit. 



T{n + 1) 







^" dn 



jjL^'' dn 



iT{n + l) 7o r(7i + l) 
"° ^" dn 
r(n+l) ' 



(45) 

(46) 
(47) 



where the last step follows from the observation that 
our original object, the probability density, is zero over 
the negative axis, = for n < 0. We hesitate to 
call this example a proof of the relation, as the most 
formal of mathematicians might object to the heuristic 
final step, but it is certainly highly suggestive that the 
normalization carries over to the continuum unaltered. 
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